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Abstract  ^*®'**4^utlonualiBlted 

Significant  progress  was  made  in  a  number  of  aspects  of  stochastic 
systems.  The  problem  of  adaptive  control  of  priority  assignment  in  queueing 
systems  was  solved.  A  distance-measures  approach  to  the  problem  of 
approximation  and  identification  of  queueing  systems  was  studied.  A  problem 
of  adaptively  controlling  a  discounted-reward  finite-state  Markov  decision 
process  was  solved.  Major  new  results  were  obtained  for  the  problem  of 
adaptive  control  with  incomplete  observations.  In  particular,  we  have 
studied  in  depth  a  problem  of  adaptive  control  with  incomplete  observations, 
in  which  the  state  is  a  finite  state  Markov  process. 


I.  SUMMARY  OF  RESEARCH  PROGRESS  AND  RESULTS 


During  the  first  year  of  research  supported  by  this  grant,  we  have 
begun  to  make  significant  progress  in  a  number  of  the  areas  which  we 
proposed  to  investigate.  In  this  section,  we  summarize  the  progress  in 
those  areas  which  have  resulted  in  publications  during  the  past  year. 


A.  Adaptive  Stochastic  Control  with  Complete  Observations 

The  assignment  of  priorities  among  customers  (or  demands  or  tasks) 
that  arrive  to  a  service  station  (or  processor)  is  an  important  problem 
encountered  in  many  situations,  from  computer  networks  to  resource  planning; 
the  adaptive  version  of  this  problem  is  considered  in  [i].  In  the  priority 
assignment  (or  dynamic  scheduling)  problem,  a  single-server  queueing  system 
is  considered  whose  customers  are  of  K  different  classes.  Customers  of  the 
several  classes  arrive  according  to  independent  Poisson  processes  with 
(known)  mean  arrival  rates  ,  i=l,...,K,  and  the  service  times,  S^. ,  for 
class  i  customers  are  independent  and  identically  distributed  with  unknown 
service  rates  =  1/m^. ,  where  m^.  =  E(S.).  The  state  process  is  X(t)  = 

(X^ ( t) , . . .  ,X|^(t) ) ,  where  X^(t)  is  the  number  of  class  i  customers  in  the 
system  at  time  t,  and  the  action  space  is  A  =  {0,1,..., K}.  The  decision 
points  T^(Tq=0)  are  the  epochs  at  which  either  a  service  is  completed  or 
a  customer  arrives  to  find  the  server  idle;  if  the  action  a  =ieA  is  chosen, 
then  the  next  customer  to  be  served  is  of  class  i,  if  1  <i  <K,  and  a  =0 


when  the  server  chooses  to  be  idle.  A  holding  cost  c^  >0  is  incurred  for 
each  unit  of  time  that  a  class  i  customer  stays  in  the  system,  so  that  a 


cost  rate  k^(x,a)  =  c-jX^  +  ...  is  incurred  until  the  next  transition 

occurs.  Thus  the  expected  cost  is  c(x,a)  =  k^(x,a)T(x,a) . 

Under  the  condition  that  the  servic|^J^||^^}j^^ifDS^.;^^slcon^^^"”  ‘ 
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moment  and  that  the  total  traffic  intensity  p  =  +  ...  satisfies 

the  stability  requirement  that  p<l,  it  can  be  obtained  that  in  the  class 
of  nonpreempti ve  work-conserving  policies,  an  optimal  stationary  policy  is 
the  well-known  "c9-rule"  that  ranks  the  classes  so  that  c^G-j  >... 

Note  that  the  c9-rule  does  not  depend  on  the  arrival  rates. 

It  should  be  noted  that,  strictly  speaking,  the  priority  assignment 
problem  is  not  included  in  the  class  of  decision  processes  discussed  above, 
because  (under  a  stationary  policy,  like  the  c0-rule  above)  the  process  X(t) 
is  not  semi-Markov;  the  process  can  have  jumps  (due  to  new  arrivals)  between 
two  consecutive  decision  points.  However,  if  we  view  a  "transition"  as 
taking  place  only  at  the  decision  points  defined  above,  namely,  if  instead 
of  X(t)  we  consider  the  process  X’(t)  =  ,  then  X'(t)  is 

semi-Markov.  The  important  observation  for  our  purposes,  though,  is  that 
X(t)  itself  is  a  semi-regenerative  process  with  embedded  Markov  chain  X^  = 
X(T^),  n=0,l,...,  and  that,  under  the  stability  assumption  p<l,  the 
processes  X(t),  X'(t)  and  X^  have  all  the  same  limiting  behavior,  that  is, 
the  same  limiting  distribution.  In  summary,  the  moral  is  that  our  adaptive 
control  scheme  can  be  applied  to  more  general  problems  provided  that  they 
can  be  reduced  to  equivalent  semi-Markov  decision  problems. 

With  respect  to  the  parameter  estimation,  we  note  that,  since  the 
unknown  parameters  0^.  =  1/m^.  (1  ^i  <K)  are  given  in  terms  of  the  mean  values 
m. ,  the  natural  strongly  consistent  estimates  to  choose  in  Step  II  are  0. 

I  1  j  1 1 

=  1/m.  ,  n=l,2,...,  where  m.  are  the  sample  mean  (or  first  moment) 

1 » n  I » n 

estimates  of  the  m^ .  Their  strong  consistency  follows  from  the  law  of  large 
numbers,  and  from  it  we  can  immediately  deduce  Step  III:  as  n f(x,0^) 
f(x,eQ)  a.s.,  for  any  state  x,  where  f  denotes  the  c0-rule,  and  6p>0Q  3re 
the  vectors  of  parameter  estimates  and  true  service  rates,  respectively. 


Notice  that,  because  of  the  particular  form  of  this  problem  and  the 
relationship  between  the  observations  and  the  unknown  parameter,  strongly 
consistent  estimates  are  obtained  from  the  easily  computable  sample  mean; 
thus  the  modification  of  maximum  likelihood  proposed  by  Kumar  and  the 
strong  hypotheses  of  other  papers  are  unnecessary.  Finally,  Step  IV,  that 
is,  the  optimality  of  the  adaptive  cO-rule  is  verified  in  [i]. 

In  [ii],  we  have  considered  general  discounted-reward  finite  state 
Markov  decision  processes  which  depend  on  unknown  parameters.  An  adaptive 
policy  inspired  by  the  nonstationary  value  iteration  (NVI)  scheme  of 
Federgruen  and  Schweitzer  is  proposed;  this  is  a  variant  of  the  usual  method 
of  successive  approximations.  It  is  shown  that  this  adaptive  policy  is 
asymptotically  discount  optimal  in  the  sense  of  Schal .  This  NVI  policy  is 
compared  with  the  certainty  equivalent  or  naive  feedback  control  (NFC) 
policy.  The  NFC  requires  computation  and  storage  of  the  optimal  policy 
for  all  values  of  the  parameter  9;  this  represents  considerable  off-line 
computation  and  considerable  storage,  particularly  if  the  parameter  set  is 
not  finite.  On  the  other  hand,  the  NVI  policy  requires  more  on-line 
computation. 

In  related  work,  we  have  considered  the  identification  and  approximation 
of  queueing  systems  in  [iii].  In  this  paper,  a  distance-measures  approach 
to  such  problems  is  taken.  This  approach  combines  ideas  from  statistical 
robustness,  information-type  measures,  and  parameter-continuity  of  stochastic 
processes.  If  one  uses  the  appropriate  distance  measure,  it  is  possible  to 
obtain  results  on  contiguity  and  asymptotic  equivalence  of  the  probability 
measures  associated  with  the  queueing  systems,  efficient  estimates,  most 
powerful  tests,  "quick"  consistency,  and  other  qualitative  information  that 
it  would  be  difficult  to  obtain  otherwise. 


B.  Adaptive  Stochastic  Control  with  Incomplete  Observations 

As  we  proposed,  we  have  begun  a  major  new  direction  of  research  involving 
adaptive  estimation  and  control  problems  for  stochastic  systems  with 
incomplete  (or  noisy)  observations  of  the  state.  We  have  already  been 
successful  in  obtaining  some  interesting  new  results;  the  first  of  these  are 
reported  in  [iv].  In  [iv],  we  consider  discounted-reward,  denumerable  state 
space,  Markov  decision  processes  (MDP's)  with  incomplete  state  information 
c.nd  depending  on  unknown  parameters.  We  are  specifically  interested  in 
three  problems:  (a)  How  do  we  obtain  a  strongly  consistent  parameter 
estimation  scheme  based  on  partial  state  information?  (b)  How  do  we  find 
"good"  approximations  of  the  optimal  reward  function?  (c)  How  do  we  find 
(asymptotically)  optimal  policies,  called  below  I-policies? 

We  approach  these  problems  by  following  the  usual  procedure  in  which 
first  the  Markov  decision  process  with  incomplete  state  information  (MDP-II) 
is  transformed  into  a  Markov  decision  process  with  complete  state  information 
(MDP-I)  whose  state  space  <I>:  =P(S)  is  the  space  of  all  probability  measures 
on  the  state  space  S  of  the  original  MDP-II.  Thus,  since  these  two  processes 
are  equivalent  --  in  the  sense  that  their  optimal  reward  functions  are  equal 
--  problems  (a),  (b)  and  (c)  are  then  transformed  into  the  standard  situation 
of  a  completely  observed  MDP-I  with  Polish  (i.e.,  complete  separable  metric) 
state  space  Having  done  this,  we  can  conclude  the  following:  (i)  There 
exists  a  sequence  of  estimators  of  the  unknown  parameters,  which  is  strongly 
consistent  for  any  I-policy.  (ii)  A  nonstationary  value-iteration  (NVI) 
scheme  can  be  used  to  solve  both  problems  (b)  and  (c). 

Part  (i)  is  obtained  by  giving  conditions  on  the  MDP-II  which  imply 


the  strong  consistency  of  the  conditional  least  squares  estimators  of  Klimko 
and  Nelson.  To  obtain  (ii)  we  use  the  NVI  scheme  of  Federgruen  and  Schweitzer 


and  the  NVI  adaptive  policy  [iv]  to  Markov  decision  processes  with  Polish 
state  and  action  spaces.  Thus,  in  short,  we  show  that  results  for  parameter- 
adaptive  discounted  MDP's  with  complete  state  observations  [ii]  under  the 
usual  (continuity  and  compactness)  assumptions  can  be  extended  to  partial ly 
observed  MDP's  with  unknown  parameters. 

In  [v],  we  have  begun  the  investigation  of  the  adaptive  estimation 
and  control  of  finite  state  Markov  processes,  as  we  proposed.  The  state  is 
a  finite  state  Markov  chain  x^ely-i . •  -  •  with  primitive  transition  matrix 
Q.  The  observation  process  y^e{0,l}.  If  Q  is  known,  there  is  a  finite 


dimensional  recursive  filter  for  P^+i||.  =  l^Pt+1 1 1”  ‘  ‘ 


’•"t+l  t 


]  ,  where 


Pt+1  |t 


where  "*^tlt-l^t|t-l '  ^t+1  conditionally  independent  given 


x^,  then  S=rQ,  where  r  =  diag(Yi » . . .  .y^^)  »  and  (1)  can  be  rewritten  in  the 


following  useful  ways: 


=  aUrr) 


Pt+l|t  "  1  ^t'  ’  Pt|t-1  ^T_  ,,  T  ,  Pt|t-l^t 

^■^Pt|t-1  ^  Ptlt-l^^'^  Ptlt-l^ 


Q^(i-r)  Y  .  Q''^r 

=  r^T„  ■■  Ptjt-i^^-^t^  ^  T  —  P 

^  Ptlt-l 


,-Y  P 


tlt-l^t 


(2) 

(3) 


t  t-1 


In  general,  the  adaptive  estimation  problem  involves  the  computation 
of  estimates  (e.g.,  state  estimates)  in  the  presence  of  unknown  parameters; 
in  addition,  estimates  of  the  parameters  are  often  computed  simultaneously. 

In  the  present  context,  the  adaptive  estimation  problem  is  that  of  computing 
recursive  estimates  of  the  conditional  probability  vector  when  the  transition 
matrix  Q  is  not  completely  known  (i.e.,  it  depends  on  a  vector  of  unknown 
parameters  0  --  henceforth,  we  express  this  dependence  via  Q{b)).  The 


approach  to  this  problem  which  we  investigate  in  [v]  has  been  widely  used 
in  linear  filtering:  we  use  the  previously  derived  recursive  filter  for  the 
conditional  probabilities,  and  we  simultaneously  recursively  estimate  the 
parameters,  plugging  the  parameter  estimates  into  the  filter.  For  example, 
for  the  filter  (3),  the  adaptive  filter  would  have  the  form: 


T- 


t 

0^  =  0 


^t->  Ptlt-1 


t-1  "^^t^t  ^t^t 


(4) 

(5) 


Q(6t)  n-r) 


It  ■  ,  T- 

Ptlt-1 


Q(\)^r 


Pt|t-l T-  Ptft-i^t 


(6) 


^  Ptlt-1 

where  {a^}  is  a  sequence  of  positive  scalars,  is  a  positive  definite 


matrix  which  modifies  the  search  direction,  and  is  an  approximation  of 
the  gradient  of  with  respect  to  9  (evaluated  at  9^  ^).  We  take  to  be 
given  by  the  Gauss-Newton  direction: 


^t  "  ^t-1  '''“‘t^^t^ 


^t  "^t-l^” 


(7) 


Also,  is  obtained  by  deriving  an  equation  for  96^(0)/39  (for  a  fixed  0), 

/N 

and  then  evaluating  at  0=e^.i  thus 
“  9€^/96  =  -y"^  ^Pt|t-1^^^ 

—y'^dt).  (8) 


Equations  for  ;(t)  (and  for  c(t),  obtained  by  substituting  9^  for  e  in  the 
4(t)  equations)  are  derived. 

These  computations  give  rise  to  a  recursive  stochastic  algorithm  of 
the  general  form 

Vi  '  Pk  ^  "k^^Pk’^k^ 

where  =  (0(^  .R|^) »  ^  ’P^  i k_l )  •  We  follow  the  approach  of 


(9) 


Kushner  to  the  Ordinary  Differential  Equation  (ODE)  Method  of  analyzing  (9). 

g  n-1 

That  is,  we  define  tf  =  I  a.  and  suppose  that  t^ as  n  Define  the 

K  •  _  ^  1  n 

piecewise-constant  interpolated  process  n  (•)  by  n  (t)  =  on 
The  idea  is  to  show  weak  convergence  of  the  sequence  (n  (•)}  to  the  solution 
of  an  ODE,  which  can  then  be  used  to  conclude  properties  (such  as  convergence 
as  t of  the  parameter  estimates  9^.  The  essential  assumption  is  that 

depends  on  in  such  a  way  that  if  n|^=n,  a  constant,  then  has  a 
unique  invariant  (or  stationary)  measure.  In  [v],  we  show  that  it  does 
indeed  have  a  unique  invariant  measure. 
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